Communities in LA County: What Do We Know?

R25 Modelers and Story Tellers

Author

Drs. Hua Zhou and Roch Nianogo

Published

July 21, 2023

1 Introduction

In this tutorial, we learn to:

  • Visualize Census data using ggplot and other plotting packages.

Resources: Analyzing US Census Data Chapter 4 and R for Data Science.

2 Basic visualization

  • Let’s start with the data on median household income and median age by county in the state of California from the 2016-2020 ACS.
library(tidycensus)

ca_wide <- get_acs(
  geography = "county",
  state = "CA",
  variables = c(
    medinc = "B19013_001", # median household income
    medage = "B01002_001"  # median age
    ),
  output = "wide",
  year = 2020
) %>%
  print()
# A tibble: 58 × 6
   GEOID NAME                            medincE medincM medageE medageM
   <chr> <chr>                             <dbl>   <dbl>   <dbl>   <dbl>
 1 06001 Alameda County, California       104888    1041    37.8     0.2
 2 06003 Alpine County, California         85750   23875    47.6     9.5
 3 06007 Butte County, California          54972    2501    36.9     0.2
 4 06011 Colusa County, California         59427    5384    35.4     0.3
 5 06013 Contra Costa County, California  103997    1239    39.9     0.1
 6 06017 El Dorado County, California      83710    2228    46.3     0.2
 7 06019 Fresno County, California         57109     929    32.4     0.2
 8 06023 Humboldt County, California       49235    2278    38.9     0.4
 9 06025 Imperial County, California       46222    2499    32.5     0.3
10 06029 Kern County, California           54851    1087    31.9     0.2
# … with 48 more rows
  • Our first ggplot figure is a histogram of median income:
library(tidyverse)
# avoid using scientific notation
options(scipen = 999)

ggplot(ca_wide) + 
  geom_histogram(aes(x = medincE))

Exercise. Display the documentation for geom_histogram().

Exercise. By default, geom_histogram uses 30 bins. Change the bins argument to other values.

ca_wide <- get_acs(
  geography = "county",
  state = "CA",
  variables = c(
    medinc = "B19013_001",
    medage = "B01002_001"
    ),
  output = "wide",
  year = 2020
)

ggplot(ca_wide) + 
  geom_histogram(aes(x = medincE), bins = 30)

Exercise. Display the histogram of the median ages.

  • Boxplot of the median household income:
ggplot(ca_wide) + 
  geom_boxplot(aes(y = medincE))

Exercise. Display the boxplot of the median ages.

  • Plot the relationship between two variables by a scatterplot:
ggplot(ca_wide) + 
  geom_point(aes(x = medageE, y = medincE))

  • Overlay with a linear model fit. Note that both geom_point() and geom_smooth layers inherit the mapping aes(x = medageE, y = medincE).
ggplot(ca_wide, aes(x = medageE, y = medincE)) + 
  geom_point() + 
  geom_smooth(method = "lm")

  • Overlay with a non-linear model fit:
ggplot(ca_wide, aes(x = medageE, y = medincE)) + 
  geom_point() + 
  geom_smooth(method = "loess")

3 Customizing ggplot visualization

  • Suppose we want to illustrate the percent of commuters that take public transportation to work for the largest metropolitan areas in the United States, using data from the 2019 1-year ACS.
library(tidycensus)
library(tidyverse)

metros <-  get_acs(
  geography = "cbsa",         # metropolitan statistical area
  variables = "DP03_0021P",   # percentage of public transportation commuters
  summary_var = "B01003_001", # total population
  survey = "acs1",
  year = 2019
) %>%
  # 10 most populous CBSAs
  slice_max(summary_est, n = 20) %>%
  print()
# A tibble: 20 × 7
   GEOID NAME                              varia…¹ estim…²   moe summa…³ summa…⁴
   <chr> <chr>                             <chr>     <dbl> <dbl>   <dbl>   <dbl>
 1 35620 New York-Newark-Jersey City, NY-… DP03_0…    31.6   0.2  1.92e7      NA
 2 31080 Los Angeles-Long Beach-Anaheim, … DP03_0…     4.8   0.1  1.32e7      NA
 3 16980 Chicago-Naperville-Elgin, IL-IN-… DP03_0…    12.4   0.3  9.46e6    1469
 4 19100 Dallas-Fort Worth-Arlington, TX … DP03_0…     1.3   0.1  7.57e6      NA
 5 26420 Houston-The Woodlands-Sugar Land… DP03_0…     2     0.2  7.07e6      NA
 6 47900 Washington-Arlington-Alexandria,… DP03_0…    13.1   0.4  6.28e6    2482
 7 33100 Miami-Fort Lauderdale-Pompano Be… DP03_0…     2.9   0.2  6.17e6      NA
 8 37980 Philadelphia-Camden-Wilmington, … DP03_0…     9.4   0.3  6.10e6      NA
 9 12060 Atlanta-Sandy Springs-Alpharetta… DP03_0…     2.8   0.2  6.02e6    3340
10 38060 Phoenix-Mesa-Chandler, AZ Metro … DP03_0…     1.8   0.2  4.95e6      NA
11 14460 Boston-Cambridge-Newton, MA-NH M… DP03_0…    13.4   0.4  4.87e6      NA
12 41860 San Francisco-Oakland-Berkeley, … DP03_0…    18.9   0.4  4.73e6      NA
13 40140 Riverside-San Bernardino-Ontario… DP03_0…     1.1   0.2  4.65e6      NA
14 19820 Detroit-Warren-Dearborn, MI Metr… DP03_0…     1.4   0.2  4.32e6      NA
15 42660 Seattle-Tacoma-Bellevue, WA Metr… DP03_0…    10.7   0.4  3.98e6      NA
16 33460 Minneapolis-St. Paul-Bloomington… DP03_0…     4.5   0.2  3.64e6      NA
17 41740 San Diego-Chula Vista-Carlsbad, … DP03_0…     2.8   0.3  3.34e6      NA
18 45300 Tampa-St. Petersburg-Clearwater,… DP03_0…     1.1   0.2  3.19e6      NA
19 19740 Denver-Aurora-Lakewood, CO Metro… DP03_0…     4.5   0.3  2.97e6      NA
20 41180 St. Louis, MO-IL Metro Area       DP03_0…     1.8   0.2  2.80e6    1618
# … with abbreviated variable names ¹​variable, ²​estimate, ³​summary_est,
#   ⁴​summary_moe
  • Bar chart:
ggplot(metros) + 
  geom_col(aes(x = NAME, y = estimate))

  • Shorten metro names, reorder by the percentage (estimat), rotate the bar chart, and more meaningful title and axis labels:
metros %>%
  mutate(NAME = str_remove(NAME, "-.*$")) %>%
  mutate(NAME = str_remove(NAME, ",.*$")) %>%
  ggplot(aes(y = reorder(NAME, estimate), x = estimate)) + 
  geom_col() + 
  theme_minimal() + 
  labs(title = "Public transit commute share", 
       subtitle = "2019 1-year ACS estimates", 
       y = "", 
       x = "ACS estimate", 
       caption = "Source: ACS Data Profile variable DP03_0021P") 

  • More styles:
library(scales)

metros %>%
  mutate(NAME = str_remove(NAME, "-.*$")) %>%
  mutate(NAME = str_remove(NAME, ",.*$")) %>%
  ggplot(aes(y = reorder(NAME, estimate), x = estimate)) + 
  geom_col(color = "navy", fill = "navy", 
           alpha = 0.5, width = 0.85) +  
  theme_minimal(base_size = 12, base_family = "Verdana") + 
  scale_x_continuous(labels = label_percent(scale = 1)) + 
  labs(title = "Public transit commute share", 
       subtitle = "2019 1-year ACS estimates", 
       y = "", 
       x = "ACS estimate", 
       caption = "Source: ACS Data Profile variable DP03_0021P") 

4 Visualizing margins of error

  • Let’s visualize the median household incomes of counties in the state of California from the 2016-2020 ACS.

  • California counties and population sizes:

ca <- get_decennial(
  state = "CA",
  geography = "county",
  variables = c(totalpop = "P1_001N"),
  year = 2020
  ) %>%
  arrange(desc(value)) %>%
  print()
# A tibble: 58 × 4
   GEOID NAME                              variable    value
   <chr> <chr>                             <chr>       <dbl>
 1 06037 Los Angeles County, California    totalpop 10014009
 2 06073 San Diego County, California      totalpop  3298634
 3 06059 Orange County, California         totalpop  3186989
 4 06065 Riverside County, California      totalpop  2418185
 5 06071 San Bernardino County, California totalpop  2181654
 6 06085 Santa Clara County, California    totalpop  1936259
 7 06001 Alameda County, California        totalpop  1682353
 8 06067 Sacramento County, California     totalpop  1585055
 9 06013 Contra Costa County, California   totalpop  1165927
10 06019 Fresno County, California         totalpop  1008654
# … with 48 more rows
  • Median household incomes:
ca_income <- get_acs(
  state = "CA",
  geography = "county",
  variables = c(hhincome = "B19013_001"),
  year = 2020
) %>%
  mutate(NAME = str_remove(NAME, " County, California")) %>%
  print()
# A tibble: 58 × 5
   GEOID NAME         variable estimate   moe
   <chr> <chr>        <chr>       <dbl> <dbl>
 1 06001 Alameda      hhincome   104888  1041
 2 06003 Alpine       hhincome    85750 23875
 3 06005 Amador       hhincome    65187  4225
 4 06007 Butte        hhincome    54972  2501
 5 06009 Calaveras    hhincome    67054  4535
 6 06011 Colusa       hhincome    59427  5384
 7 06013 Contra Costa hhincome   103997  1239
 8 06015 Del Norte    hhincome    49981  5145
 9 06017 El Dorado    hhincome    83710  2228
10 06019 Fresno       hhincome    57109   929
# … with 48 more rows
  • Basic plot:
ggplot(ca_income, aes(x = estimate, y = reorder(NAME, estimate))) + 
  geom_point(size = 1.5, color = "darkgreen") + 
  labs(title = "Median household income", 
       subtitle = "Counties in California", 
       x = "", 
       y = "ACS estimate") + 
  theme_minimal(base_size = 9) + 
  scale_x_continuous(labels = label_dollar()) + 
  scale_y_discrete(guide = guide_axis(n.dodge = 2))

  • Add error bar (from margin of errors):
ggplot(ca_income, aes(x = estimate, y = reorder(NAME, estimate))) + 
  geom_errorbarh(aes(xmin = estimate - moe, xmax = estimate + moe)) + 
  geom_point(size = 1.5, color = "darkgreen") + 
  theme_minimal(base_size = 8) + 
  labs(title = "Median household income", 
       subtitle = "Counties in California", 
       x = "2016-2020 ACS estimate", 
       y = "") + 
  scale_x_continuous(labels = label_dollar()) + 
  scale_y_discrete(guide = guide_axis(n.dodge = 2))

What do we observe?

ca %>% arrange(value)
# A tibble: 58 × 4
   GEOID NAME                         variable value
   <chr> <chr>                        <chr>    <dbl>
 1 06003 Alpine County, California    totalpop  1204
 2 06091 Sierra County, California    totalpop  3236
 3 06049 Modoc County, California     totalpop  8700
 4 06051 Mono County, California      totalpop 13195
 5 06105 Trinity County, California   totalpop 16112
 6 06043 Mariposa County, California  totalpop 17131
 7 06027 Inyo County, California      totalpop 19016
 8 06063 Plumas County, California    totalpop 19790
 9 06011 Colusa County, California    totalpop 21839
10 06015 Del Norte County, California totalpop 27743
# … with 48 more rows

5 Visualize ACS estimates over time

  • Let’s obtain 1-year ACS data from 2005 through 2021 (no 2020 ACS data) on median home value for Deschutes County, Oregon, home to the city of Bend and large numbers of in-migrants in recent years from the Bay Area in California.
years <- c(2005:2019, 2021)
names(years) <- years

deschutes_value <- map_dfr(years, ~{
  get_acs(
    geography = "county",
    variables = "B25077_001", # median home value
    state = "OR",
    county = "Deschutes",
    year = .x,
    survey = "acs1"
  )
}, .id = "year") %>%
  print()
# A tibble: 16 × 6
   year  GEOID NAME                     variable   estimate   moe
   <chr> <chr> <chr>                    <chr>         <dbl> <dbl>
 1 2005  41017 Deschutes County, Oregon B25077_001   236100 13444
 2 2006  41017 Deschutes County, Oregon B25077_001   336600 11101
 3 2007  41017 Deschutes County, Oregon B25077_001   356700 16765
 4 2008  41017 Deschutes County, Oregon B25077_001   331600 17104
 5 2009  41017 Deschutes County, Oregon B25077_001   284300 12652
 6 2010  41017 Deschutes County, Oregon B25077_001   260700 18197
 7 2011  41017 Deschutes County, Oregon B25077_001   216200 18065
 8 2012  41017 Deschutes County, Oregon B25077_001   235300 19016
 9 2013  41017 Deschutes County, Oregon B25077_001   240000 16955
10 2014  41017 Deschutes County, Oregon B25077_001   257200 18488
11 2015  41017 Deschutes County, Oregon B25077_001   312600 18751
12 2016  41017 Deschutes County, Oregon B25077_001   317500 17479
13 2017  41017 Deschutes County, Oregon B25077_001   368600 16477
14 2018  41017 Deschutes County, Oregon B25077_001   408500 16808
15 2019  41017 Deschutes County, Oregon B25077_001   413500 19506
16 2021  41017 Deschutes County, Oregon B25077_001   543900 30518
  • A stylized plot:
ggplot(deschutes_value, aes(x = year, y = estimate, group = 1)) + 
  geom_ribbon(
    aes(ymax = estimate + moe, ymin = estimate - moe),
    fill = "navy",
    alpha = 0.4
    ) +
  geom_line(color = "navy") + 
  geom_point(color = "navy", size = 2) + 
  theme_minimal(base_size = 12) + 
  scale_y_continuous(labels = label_dollar(scale = .001, suffix = "k")) + 
  labs(
    title = "Median home value in Deschutes County, OR",
    x = "Year",
    y = "ACS estimate",
    caption = "Shaded area represents margin of error around the ACS estimate"
    )

6 Exploring age and sex structure with population pyramids

  • We use data from the Population Estimates API for the state of California:
ca <- get_estimates(
  geography = "state",
  state = "CA",
  product = "characteristics",
  breakdown = c("SEX", "AGEGROUP"),
  breakdown_labels = TRUE,
  year = 2019
  ) %>%
  print()
# A tibble: 96 × 5
   GEOID NAME          value SEX        AGEGROUP          
   <chr> <chr>         <dbl> <chr>      <fct>             
 1 06    California 39512223 Both sexes All ages          
 2 06    California  2383716 Both sexes Age 0 to 4 years  
 3 06    California  2478850 Both sexes Age 5 to 9 years  
 4 06    California  2529137 Both sexes Age 10 to 14 years
 5 06    California  2533694 Both sexes Age 15 to 19 years
 6 06    California  2647279 Both sexes Age 20 to 24 years
 7 06    California  3094978 Both sexes Age 25 to 29 years
 8 06    California  2957974 Both sexes Age 30 to 34 years
 9 06    California  2780574 Both sexes Age 35 to 39 years
10 06    California  2501526 Both sexes Age 40 to 44 years
# … with 86 more rows
  • Remove rows for Both sexes and flip values of Male to negative.
ca_filtered <- ca %>%
  filter(
    str_detect(AGEGROUP, "^Age"), 
    SEX != "Both sexes"
    ) %>%
  mutate(value = ifelse(SEX == "Male", -value, value)) %>%
  print()
# A tibble: 36 × 5
   GEOID NAME          value SEX   AGEGROUP          
   <chr> <chr>         <dbl> <chr> <fct>             
 1 06    California -1220085 Male  Age 0 to 4 years  
 2 06    California -1266874 Male  Age 5 to 9 years  
 3 06    California -1292852 Male  Age 10 to 14 years
 4 06    California -1292178 Male  Age 15 to 19 years
 5 06    California -1359751 Male  Age 20 to 24 years
 6 06    California -1600681 Male  Age 25 to 29 years
 7 06    California -1525571 Male  Age 30 to 34 years
 8 06    California -1416586 Male  Age 35 to 39 years
 9 06    California -1255993 Male  Age 40 to 44 years
10 06    California -1252341 Male  Age 45 to 49 years
# … with 26 more rows
  • Pyramid plot:
ggplot(ca_filtered) + 
  geom_col(aes(x = value, y = AGEGROUP, fill = SEX))

  • A more stylized pyramid:
ca_pyramid <- ggplot(ca_filtered, 
                       aes(x = value, 
                           y = AGEGROUP, 
                           fill = SEX)) + 
  geom_col(width = 0.95, alpha = 0.75) + 
  theme_minimal(base_family = "Verdana", 
                base_size = 12) + 
  scale_x_continuous(
    labels = ~ number_format(scale = .001, suffix = "k")(abs(.x)),
    limits = 1000000 * c(-1.75, 1.75)
  ) + 
  scale_y_discrete(labels = ~ str_remove_all(.x, "Age\\s|\\syears")) + 
  scale_fill_manual(values = c("darkred", "navy")) + 
  labs(x = "", 
       y = "2019 Census Bureau population estimate", 
       title = "Population structure in California", 
       fill = "", 
       caption = "Data source: US Census Bureau population estimates & tidycensus R package")

ca_pyramid

  • With aid of the geofacet package, we can visualize pyramids of multiple states:
library(geofacet)

us_pyramid_data <- get_estimates(
  geography = "state",
  product = "characteristics",
  breakdown = c("SEX", "AGEGROUP"),
  breakdown_labels = TRUE,
  year = 2019
) %>%
  filter(str_detect(AGEGROUP, "^Age"),
         SEX != "Both sexes") %>%
  group_by(NAME) %>%
  mutate(prop = value / sum(value, na.rm = TRUE)) %>%
  ungroup() %>%
  mutate(prop = ifelse(SEX == "Male", -prop, prop)) %>%
  print()
# A tibble: 1,872 × 6
   GEOID NAME         value SEX   AGEGROUP              prop
   <chr> <chr>        <dbl> <chr> <fct>                <dbl>
 1 28    Mississippi  93187 Male  Age 0 to 4 years   -0.0313
 2 28    Mississippi  96334 Male  Age 5 to 9 years   -0.0324
 3 28    Mississippi 105273 Male  Age 10 to 14 years -0.0354
 4 28    Mississippi 102782 Male  Age 15 to 19 years -0.0345
 5 28    Mississippi 102207 Male  Age 20 to 24 years -0.0343
 6 28    Mississippi 104027 Male  Age 25 to 29 years -0.0350
 7 28    Mississippi  91553 Male  Age 30 to 34 years -0.0308
 8 28    Mississippi  92312 Male  Age 35 to 39 years -0.0310
 9 28    Mississippi  85188 Male  Age 40 to 44 years -0.0286
10 28    Mississippi  87453 Male  Age 45 to 49 years -0.0294
# … with 1,862 more rows
ggplot(us_pyramid_data, aes(x = prop, y = AGEGROUP, fill = SEX)) + 
  geom_col(width = 1) + 
  theme_minimal() + 
  scale_fill_manual(values = c("darkred", "navy")) + 
  facet_geo(~NAME, grid = "us_state_with_DC_PR_grid2",
            label = "code") + 
  theme(axis.text = element_blank(),
        strip.text.x = element_text(size = 8)) + 
  labs(x = "", 
       y = "", 
       title = "Population structure by age and sex", 
       fill = "", 
       caption = "Data source: US Census Bureau population estimates & tidycensus R package")

7 Visualizing group-wise comparison

Southern California housing

housing_val <- get_acs(
  geography = "tract", 
  variables = "B25077_001", 
  state = "CA", 
  county = c(
    "Imperial",
    "Los Angeles",
    "Orange", 
    "San Diego", 
    "Santa Barbara",
    "Ventura"
  ),
  year = 2020
) %>% 
  separate(
    NAME,
    into = c("tract", "county", "state"),
    sep = ", "
    ) %>%
  group_by(county) %>%
  print()
# A tibble: 4,188 × 7
# Groups:   county [6]
   GEOID       tract               county          state   varia…¹ estim…²   moe
   <chr>       <chr>               <chr>           <chr>   <chr>     <dbl> <dbl>
 1 06025010101 Census Tract 101.01 Imperial County Califo… B25077…      NA    NA
 2 06025010102 Census Tract 101.02 Imperial County Califo… B25077…  144000 11653
 3 06025010200 Census Tract 102    Imperial County Califo… B25077…  157800 21388
 4 06025010300 Census Tract 103    Imperial County Califo… B25077…  295300 52583
 5 06025010401 Census Tract 104.01 Imperial County Califo… B25077…  138300 61660
 6 06025010402 Census Tract 104.02 Imperial County Califo… B25077…  147200 23108
 7 06025010500 Census Tract 105    Imperial County Califo… B25077…  230100 14814
 8 06025010600 Census Tract 106    Imperial County Califo… B25077…  257000 23904
 9 06025010700 Census Tract 107    Imperial County Califo… B25077…  163100 25943
10 06025010800 Census Tract 108    Imperial County Califo… B25077…  285100 32973
# … with 4,178 more rows, and abbreviated variable names ¹​variable, ²​estimate

Summary statistics:

housing_val %>%
  summarize(
    min = min(estimate, na.rm = TRUE), 
    mean = mean(estimate, na.rm = TRUE), 
    median = median(estimate, na.rm = TRUE), 
    max = max(estimate, na.rm = TRUE)
    ) %>%
  arrange(desc(median)) %>%
  print()
# A tibble: 6 × 5
  county                  min    mean median     max
  <chr>                 <dbl>   <dbl>  <dbl>   <dbl>
1 Orange County         59600 744923. 658600 2000001
2 Santa Barbara County 197600 735344. 638050 2000001
3 Los Angeles County    45200 674727. 576800 2000001
4 Ventura County       129400 610242. 573400 2000001
5 San Diego County       9999 631103. 561000 2000001
6 Imperial County       17300 189126. 189600  302300

Density plots:

ggplot(housing_val, aes(x = estimate, fill = county)) + 
  geom_density(alpha = 0.3)
Warning: Removed 201 rows containing non-finite values (`stat_density()`).

Facet plots:

ggplot(housing_val, aes(x = estimate)) +
  geom_density(fill = "darkgreen", color = "darkgreen", alpha = 0.5) + 
  facet_wrap(~county) + 
  scale_x_continuous(labels = dollar_format(scale = 0.000001, 
                                            suffix = "m")) + 
  theme_minimal(base_size = 14) + 
  theme(axis.text.y = element_blank(), 
        axis.text.x = element_text(angle = 45)) + 
  labs(x = "ACS estimate",
       y = "",
       title = "Median home values by Census tract, 2015-2019 ACS")
Warning: Removed 201 rows containing non-finite values (`stat_density()`).

Ridgeline plots:

library(ggridges)

ggplot(housing_val, aes(x = estimate, y = county)) + 
  geom_density_ridges() + 
  theme_ridges() + 
  labs(x = "Median home value: 2016-2020 ACS estimate", 
       y = "") + 
  scale_x_continuous(labels = label_dollar(scale = .000001, suffix = "m"),
                     breaks = c(0, 500000, 1000000)) + 
  theme(axis.text.x = element_text(angle = 45))
Warning: Removed 201 rows containing non-finite values
(`stat_density_ridges()`).

8 Interactive visualization with plotly

library(plotly)

ggplotly(ca_pyramid)

9 Exercises

  • Choose a different variable in the ACS and/or a different location and create a margin of error visualization of your own.
  • Modify the population pyramid code to create a different, customized population pyramid. You can choose a different location (state or county), different colors/plot design, or some combination!